Goto

Collaborating Authors

 sparse adaptive connection


SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems

While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. In contrast with previous self-attention models with pre-defined structures (edges), the model learns to construct attention edges to improve task-specific performances. In this way, the model is able to select the most salient nodes and reduce the quadratic complexity regardless of the sequence length. Based on SAC, we show that previous variants of self-attention models are its special cases. Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classification, we demonstrate SAC is competitive with state-of-the-art models while significantly reducing memory cost.



Review for NeurIPS paper: SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems

Weaknesses: My main concern is about the computational cost the proposed method. The method requires running a LSTM on each token on every layer (or even every head) sequentially. Compared to the parallel processing of Transformers, I would expect this sequential computation to be quite slow. All those factors should affect the computation speed in a negative way. Given that the computational efficiency is the goal of the paper, the authors must discuss them in detail.


Review for NeurIPS paper: SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems

This paper addresses the quadratic bottleneck in transformer architecture. It proposes a Sparse Adaptive Connection (SAC) model which learns to predict sparse connections (attention links) between inputs and attentions are only performed on those predictive links. The proposed method is competitive with state-of-the-art models on WMT, LM and Image classification tasks while significantly reducing memory cost. Overall, three of the four reviewers seem to have liked the paper, although they had some concerns (below), while one reviewer (R3) proposed weak reject. A weakness pointed out by R2 and R3 is that only accuracy is reported, but speed is not, which seems necessary to support the title "Accelerating Self-Attention". The authors promised to add more details about computational efficiency and memory cost in the final version, and I urge them to do so.


SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems

While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. In contrast with previous self-attention models with pre-defined structures (edges), the model learns to construct attention edges to improve task-specific performances. In this way, the model is able to select the most salient nodes and reduce the quadratic complexity regardless of the sequence length.